1 Motivation

The COVID-19 pandemic has not only affected global economies in terms of GDP growth, but also on the path of human development. The core foundations of the concept of human development i.e. a healthy and long life has also been affected. Human development is measured using three core indicators which are; life expectancy (at birth,) knowledge (mean years of schooling) and standard of living (Gross National Income per capita.) The Human Development Index is a critique of the ‘economic growth before people’ model of development. In this project I would like to answer the question: How COVID-19 related deaths correlates with the Human Development Index (considering both low and high HDI countries?) This topic peaked our interests as the media has speculated that most developed countries who rank very high on the HDI topped the charts of corona virus deaths worldwide per million population, albeit having a strongest and most efficient healthcare facilities. In contrast, it is believed that developing countries like India and South Africa with middle-low ranked HDI Indices have managed to keep COVID-19 infections and death rates at some of the lowest levels in the world. To check the relevancy of these speculations, I will use the Our World In Data Covid data set to analyse Covid cases against the components of HDI and provide a summary of findings.

2 Analysis Plan

To answer our research question, Firstly, I will perform data retrieval (identifying data sources, defining important variables and relevant formulas for calculating Human Development Index). Secondly, I will perform data processing which includes data transformation, data translation, checking consistency of data, data pre-processing, checking outliers and handling of missing values. Thirdly, I will perform Exploratory Data Analysis, make statistical summaries, visualization of results and correlation plots. Then I will perform a hypothesis testing of HDI vs total deaths in (all countries, both low & high countries, continent wave 1) and make a hypothesis comparison table as well as data normalization. To confirm the hypothesis test result I will perform linear regression for high HDI countries and identify the significant variables which contributes more to covid death.

3 Data retrieval

3.1 Data source Identification:

Two datasets are considered for this analysis, namely OWID (Our World in Data) and UN (United Nations).

Our World In Data (OWID): OWID could be the right data source for this analysis. It auto extracts data from (Johns Hopkins University)JHU and World Health Organisation(WHO) which are in turn getting daily data from respective country health authorities and they are reliable. It has all the required variables for analysis starting from HDI and its attributes and total death per million. It contains totally 131289 observations with 65 variables.

Link: https://github.com/owid/covid-19-data/raw/master/public/data/owid-covid-data.csv

United Nations (UN): UN data contains all the statistical information specific to country. I have chosen UN data to extract the education index which is not available in the OWID dataset. It contains totally 195 observations with 15 variables.

Link: http://hdr.undp.org/en/data

3.2 Data retrieval/data extraction

OWID data set is extracted from the database and saved as data_Covid. Similarly, UN data set is extracted and saved as un_edu_data

#Extracting the OWID dataset:
if (!dir.exists("data")) { dir.create("data") } 
if (!file.exists("data/owid-covid-data.csv")) 
  download.file("https://covid.ourworldindata.org/data/owid-covid-data.csv",destfile = "data/owid-covid-data.csv")
data_Covid <- read_csv("data/owid-covid-data.csv", 
                       col_types = cols(
                         .default = col_double(),  
                         iso_code = col_character(),  
                         continent = col_character(),
                         location = col_character(),  
                         date = col_date(format = ""), 
                         tests_units = col_character()
                       )) 

#Extracting UN dataset for education index:
if (!file.exists("data/un_edudata.xlsx")) 
  download.file("http://hdr.undp.org/sites/default/files/2020_statistical_annex_all.xlsx",destfile ="data/un_edudata.xlsx",
                mode = "wb")
un_edu_data <- read_xlsx("data/un_edudata.xlsx", sheet = 2, range = "A5:O200")

3.3 Define the variables

Feature selection is one the vital task to perform data analysis and exploration There are various statistical methods included in feature selection. In this project our motive is to focus on the Human Development index and the total death. Hence, I select features which are closely related to this component. The OWID dataset has 65 variables. All the variables are not required for our analysis. I select only few variables here,

gdp_per_capita: Per capita income or total income measures the average income in a given country in a specified year.It is calculated by dividing the country’s total income by its total population size.OWID doesn’t mention the source, but this data point is maintained by World bank. Latest available is 2011 constant USD. The values are in fraction. It ranges from 702 in Burundi/Africa to 116935 in Qatar/Asia. Units are measured in USD. This variable takes the purchasing power parity into account.

Life expectancy: Life expectancy is the average age an individual is expected to live based on statistical average. Central African Republic has low life expectancy of 53.28 and Monaco has highest range of 86.75. The values are in fraction. The observations are extracted from WHO into OWID.

Education Index: It refers to the knowledge received through schooling or instruction and to the institution of teaching as a whole.The values of the index are in fraction. Seychelles has the highest mean years of schooling with HDI rank of 67 and Burkina Faso has the lowest mean years of schooling with HDI rank of 182.

human_development_index: The Human Development Index is a statistic composite index of life expectancy/long-healthy life, education/knowledge, and per capita income/standard of living, which are used to rank human development of the countries. This is calculated in 2 steps from 3 variables for each countries (life expectency / Health, Education/Expected & Mean years of schooling, Standard of living/GDP Per capita PPP). Values of each of the four metrics are first normalized to an index value of 0 to aggregating the four metrics to produce the HDI. HDI is 1 if a country achieves maximum value(well developed) and it is 0 for a country that is at the minimum value

Reference Link 1 : OWID https://ourworldindata.org/human-development-index

Reference Link 2 : United Nations http://hdr.undp.org/en/indicators/137506

Total_death_per_million: This defines the total confirmed deaths per 1, 000,000 people. The values are denoted in fraction. Here, Asia- India has lowest total deaths per million as 0.001 and South America - Peru having highest death rate of 66112.85 per million.

3.4 Define HDI and its components

The Human Development Index (HDI) is a statistic developed and compiled by the United Nations to measure various countries levels of social and economic development. The index is used to follow changes in development levels over time and to compare the development levels of different countries.

The HDI components comprise of:

  1. Education – mean years of schooling and expected years of schooling and measured by adult literacy rate

  2. Life expectancy at birth – as an index of population health and longevity

  3. Gross national income per capita (GDP) – it is a monetary measure of the finished goods in a country at a specific period.

3.5 HDI formula and calculation

HDI is a single index measure which takes into account the 3 components - extended healthy life, exposure to knowledge and a good standard of living. It uses 4 matrices namely,

1. life expectancy at birth

2. Education

  • Expected years of schooling

  • Average years of schooling

3. Gross National Income (GNI) per capita

Calculation is done using two steps:

Step 1: Formulating indices for each of the metrics Taking actual value for a country, global maximum values and global minimum value is evaluated using the formula,

Dimension index = (actual value - minimum value) / (maximum value - minimum value)

Dimension index ranges between 0 and 1, which denotes minimum and maximum value respectively.

Note: While calculating for knowledge, dimension index for expected years of schooling is calculated separately and dimension index for average years of schooling is calculated separately. Finally, arithmetic mean of both the values are taken.

Step 2: Summing up the values of four metrics to generate HDI Having the 3 dimension indeces, now geometric mean of all the 3 values are taken in order to calculate HDI using the following formula,

HDI = (I(Health) * I(Education) * I(Income))^ (1/3)

Link: https://ourworldindata.org/human-development-index#how-is-the-human-development-index-calculated

4 Data processing

4.1 Data Transformation

4.1.1 Data translation

#Selecting only required variables in data set with date filtered as maximum date which takes the updated date.
data_owid <- data_Covid %>% 
  select(continent, location, date, population,  median_age, aged_65_older, aged_70_older, gdp_per_capita, 
        extreme_poverty,cardiovasc_death_rate, diabetes_prevalence, female_smokers, male_smokers,
        handwashing_facilities, hospital_beds_per_thousand, life_expectancy, human_development_index, 
        total_deaths_per_million,total_deaths,iso_code) %>% 
        group_by(location) %>% filter(date == max(date)) %>% # Take the most recent date per country
        group_by() #removing groupby

I select only the required predictor variables with latest date which will help us to achieve our goal of find the linear dependency between HDI and total COVID deaths. Since the raw data is already in csv format, there is no data transformation performed here.

#Checking Number of rows and columns in the dataset:
dim(data_owid)
## [1] 237  20

OWID dataset has 237 observations with 20 variables. UN dataset has 195 observations with 15 variables.

#Summary statistics
summary(data_owid)
##   continent           location              date              population       
##  Length:237         Length:237         Min.   :2021-04-14   Min.   :4.700e+01  
##  Class :character   Class :character   1st Qu.:2021-11-22   1st Qu.:6.812e+05  
##  Mode  :character   Mode  :character   Median :2021-11-22   Median :6.897e+06  
##                                        Mean   :2021-11-17   Mean   :1.352e+08  
##                                        3rd Qu.:2021-11-22   3rd Qu.:3.111e+07  
##                                        Max.   :2021-11-22   Max.   :7.875e+09  
##                                                             NA's   :2          
##    median_age   aged_65_older    aged_70_older    gdp_per_capita    
##  Min.   :15.1   Min.   : 1.144   Min.   : 0.526   Min.   :   661.2  
##  1st Qu.:22.1   1st Qu.: 3.507   1st Qu.: 2.063   1st Qu.:  4025.4  
##  Median :29.6   Median : 6.224   Median : 3.845   Median : 12294.9  
##  Mean   :30.3   Mean   : 8.607   Mean   : 5.431   Mean   : 19092.2  
##  3rd Qu.:38.7   3rd Qu.:13.914   3rd Qu.: 8.607   3rd Qu.: 27012.3  
##  Max.   :48.2   Max.   :27.049   Max.   :18.493   Max.   :116935.6  
##  NA's   :46     NA's   :48       NA's   :47       NA's   :42        
##  extreme_poverty  cardiovasc_death_rate diabetes_prevalence female_smokers 
##  Min.   : 0.100   Min.   : 79.37        Min.   : 0.990      Min.   : 0.10  
##  1st Qu.: 0.625   1st Qu.:172.39        1st Qu.: 5.378      1st Qu.: 1.90  
##  Median : 2.500   Median :244.31        Median : 7.200      Median : 6.30  
##  Mean   :13.848   Mean   :264.76        Mean   : 8.480      Mean   :10.79  
##  3rd Qu.:21.350   3rd Qu.:334.87        3rd Qu.:10.658      3rd Qu.:19.20  
##  Max.   :77.600   Max.   :724.42        Max.   :30.530      Max.   :44.00  
##  NA's   :111      NA's   :47            NA's   :35          NA's   :90     
##   male_smokers   handwashing_facilities hospital_beds_per_thousand
##  Min.   : 7.70   Min.   :  1.188        Min.   : 0.100            
##  1st Qu.:22.60   1st Qu.: 20.482        1st Qu.: 1.300            
##  Median :33.10   Median : 49.691        Median : 2.450            
##  Mean   :32.91   Mean   : 50.789        Mean   : 3.038            
##  3rd Qu.:41.30   3rd Qu.: 82.687        3rd Qu.: 4.050            
##  Max.   :78.10   Max.   :100.000        Max.   :13.800            
##  NA's   :92      NA's   :141            NA's   :65                
##  life_expectancy human_development_index total_deaths_per_million
##  Min.   :53.28   Min.   :0.3940          Min.   :   3.101        
##  1st Qu.:68.33   1st Qu.:0.6030          1st Qu.: 111.438        
##  Median :74.70   Median :0.7400          Median : 582.160        
##  Mean   :73.41   Mean   :0.7225          Mean   : 936.132        
##  3rd Qu.:78.89   3rd Qu.:0.8287          3rd Qu.:1575.392        
##  Max.   :86.75   Max.   :0.9570          Max.   :6022.108        
##  NA's   :17      NA's   :47              NA's   :39              
##   total_deaths       iso_code        
##  Min.   :      1   Length:237        
##  1st Qu.:    592   Class :character  
##  Median :   3328   Mode  :character  
##  Mean   : 107891                     
##  3rd Qu.:  18690                     
##  Max.   :5159431                     
##  NA's   :38

Summary provides details about the statistical analysis of each of the features. It provides Minimum value, 1st Quantile, Median, Mean, 3rd Quantile, maximum values and the number of NA values to validate our data set. With respect to our response variable, I can observe that there is huge difference between the minimum and maximum value of the variables as 3.101 and 6006.490 respectively. This shows how widely our data is spread and how extreme our observations are.

#Removing missing values in y variable - total_deaths
data_owid <- data_owid %>% select(continent: iso_code)%>% 
              filter(!is.na(total_deaths_per_million))

I select total_deaths_per_million as our target variable. While checking for the NA values, it shows that there are 39 values which are NA. It is advisable to remove the NA values completely rather then performing analysis with the missing values in the dependent variable which might result in biased output.

#Clean UN education dataset and calculate index variable:
#Rename education variables
names(un_edu_data)[2] <- "location"
names(un_edu_data)[7] <- "expected_yr_Schooling"
names(un_edu_data)[9] <- "mean_yr_Schooling"

#Filter and select only the relevant education variables
un_edu_data <- un_edu_data %>%
            select(location,expected_yr_Schooling,mean_yr_Schooling) 
un_edu_data <- un_edu_data %>%
            filter(!is.na(location)) %>% 
            filter(location !="Country") %>% 
            filter(!is.na(expected_yr_Schooling))
#Computing Education index from UN data set:

#Change col type as numeric for finding average
un_edu_data$expected_yr_Schooling <- as.numeric(as.character(un_edu_data$expected_yr_Schooling))  
un_edu_data$mean_yr_Schooling <- as.numeric(as.character(un_edu_data$mean_yr_Schooling))  
sapply(un_edu_data, typeof)
##              location expected_yr_Schooling     mean_yr_Schooling 
##           "character"              "double"              "double"
#Calculate Education index
un_edu_data <- un_edu_data %>%
  mutate(education_index = (expected_yr_Schooling + mean_yr_Schooling)/2) %>%
  select(location, education_index)
apply(is.na(un_edu_data),2, sum)
##        location education_index 
##               0               0

Education index is calculated by summing up the expected year schooling and mean year schooling and diving it by 2. These are done in the UN data set as I will be using this variable further in our project as it is the component of HDI.

4.2 Data Pre-processing

4.2.1 Checking Outliers

Outlines are extreme values that lies outside the minimum and maximum range. They can be always removed from the data set as they might affect the distribution of the data. It might cause error variance and thereby reduce the performance of the various statistical tests which will be performed further. Additionally, they also change the assumption that the variables are in linear relationship with each other. There are various ways in which outliers can be identified and handled.

#Visualizing outliers with box plot
ggplot(data_owid) +
  aes(x = "", y = total_deaths_per_million) +
  geom_boxplot(fill = "#0c4c8a") +
  theme_minimal()

The box plot displays how the variable - total_deaths_per_million is dispersed with five components,

1. Minimum

2. First quartile (Q1)

3. Median

4. Third quartile (Q3)

5. Maximum

From the box plot, I can observe that there is an outlier toward the range of 6000. The minimum observation and the maximum observation in the total death per million lies between 3.101 to 6006.490, with median towards the 1st quartile. This shows that the data points are right skewed. This means a wider range of observations of total deaths per million are more spread between the median and the 3rd quartile. Moreover, the data are condensed towards the 1st quartile and the median.

Statistical test for detecting outliers:

There are three formal techniques which is used to detect outliers.

1. Grubbs’s test: This test identifies if the lower observation or the highest observation is an outlier in a dataset

2. Dixon’s test: This is similar to Grubb’s test. Once the outlier is identified, this test will be performed on the identified outlier.

3. Rosner’s test: This test can be performed to identify multiple outlier at the same time, which Grubbs’s test and Dixon’s test fails to perform. This test also has an advantage to avoiding masking, where an outlier which lies close to another outlier can go unidentified.

#Rosner’s test - total_deaths_per_million
data_owid_death_out <- rosnerTest(data_owid$total_deaths_per_million,
                                  k = 3)
data_owid_death_out
## $distribution
## [1] "Normal"
## 
## $statistic
##      R.1      R.2      R.3 
## 5.040462 3.261557 3.127304 
## 
## $sample.size
## [1] 198
## 
## $parameters
## k 
## 3 
## 
## $alpha
## [1] 0.05
## 
## $crit.value
## lambda.1 lambda.2 lambda.3 
## 3.602505 3.600981 3.599448 
## 
## $n.outliers
## [1] 1
## 
## $alternative
## [1] "Up to 3 observations are not\n                                 from the same Distribution."
## 
## $method
## [1] "Rosner's Test for Outliers"
## 
## $data
##   [1]  184.886  161.427 1061.284  135.062 1680.585   50.982 1185.074 2551.911
##   [9] 2462.158  258.075   76.314 1331.627  748.972 1690.543  797.348  168.096
##  [17]  733.382  525.794 2292.661 1392.885   12.931    3.847 1612.449 3771.765
##  [25] 1007.826 2863.555  217.425 3988.455   12.327    3.101  171.422   65.016
##  [33]  777.006  622.886   20.529   10.346 1985.649    3.210 2499.481  168.832
##  [41]   59.926 1413.490   25.985 2542.105  732.936  659.595 3012.154   11.951
##  [49]  486.471  185.592  484.953  382.519 1851.751  191.745  576.206  117.250
##  [57]   15.272 1317.549 1064.511   56.780 1855.680 1861.777  769.743  226.734
##  [65] 1788.712  120.237  137.519 2902.929 1185.178   38.069 1680.207 1769.677
##  [73]  868.445   28.673   72.439 1236.194   62.296 1489.530 1032.794   28.201
##  [81] 3443.164   99.021  334.537  520.130 1517.757  575.191 1125.649  880.099
##  [89] 2207.265  793.351  145.537 1107.895  930.826   96.898 1673.293  569.243
##  [97]  411.415   18.023 2124.874 1277.117  306.614   55.403  774.588 1594.604
## [105] 2433.954   55.887  342.962 1357.878   33.911  117.266  917.221  456.201
## [113]   28.481  899.051  171.933  188.468 2245.655 2204.012  910.931  588.115
## [121] 3587.288  395.343   60.255  347.298 1379.407  387.533 1132.062    7.809
## [129]   31.481    9.789   14.068 1942.723 3576.674  183.327   74.803  787.422
## [137]  127.300  915.034 1678.389   58.011 2266.456 6022.108  425.838 2138.530
## [145] 1803.613  208.495 2881.308 1784.082  101.005  522.915 1513.007  656.068
## [153] 2734.490  250.712  249.741  109.500 1641.697 1263.775   14.862  122.305
## [161] 2548.928 2443.808   80.932 1492.022 2713.224   64.867   11.686 1878.952
## [169]  658.594   69.095 1946.610 1487.181 1308.819  148.230   35.548   12.821
## [177]   11.821  292.148   90.782   28.662 1408.748 2123.618  884.673   69.095
## [185] 1992.669  214.591 2117.286 2319.943  863.391 1754.586   40.606    3.180
## [193]  177.252  243.978  655.169   63.593  193.809  311.353
## 
## $data.name
## [1] "data_owid$total_deaths_per_million"
## 
## $bad.obs
## [1] 0
## 
## $all.stats
##   i   Mean.i      SD.i    Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 936.1319 1009.0297 6022.108     142 5.040462   3.602505    TRUE
## 2 1 910.3147  943.7640 3988.455      28 3.261557   3.600981   FALSE
## 3 2 894.6099  920.0112 3771.765      24 3.127304   3.599448   FALSE
## 
## attr(,"class")
## [1] "gofOutlier"
data_owid_death_out$all.stats

The result of Rosner’s test states that there is one outlier identified in lines 142 which has a value of 6006.490 with a mean of 901.2832 and standard deviation of 970.2075.

data_owid_total_deaths_mill <- data_owid[order(-data_owid$total_deaths_per_million),] 
head(data_owid_total_deaths_mill,2)

The above tables states that Peru has the highest number of death which is 6006.490 and Tanzania has lowest death of 3.101

#Removing outliers from total_deaths_per_million:
data_owid <- data_owid %>% filter(total_deaths_per_million < 4000)
#Visualize the log transform of the total death variable:
ggplot(data_owid) +
  aes(x = total_deaths_per_million) +
  geom_histogram(bins = 30L, fill = "#0c4c8a") +
  theme_minimal() +
  xlab("Total Number of Deaths per million")+
  ggtitle('Histogram of Logarithm of Total Number of Deaths per million')

The above histogram illustrates that the outlier is removed.

4.2.2 Handling missing values

Missing values: Missing values are observation which has no data. Processing it without taking care of it might lead to poor performance of the statistical test and machine learning models. It might lead to more biased variable leading to inaccurate results.

Missing values are of 3 types.

1. Missing completely at random (MCAR) : There is no relationship between the observed and unobserved values which are missing.

2. Missing at random : There is a relationship between the observed and unobserved values as there exist no pattern between them.

3. Missing not at random : The missing data is related to the unobserved data. 4. Structurally missing data: There is a logical reason behind the non-existance of data.

Approaches to handle missing data:

1. Deletion methods to remove missing values.

2. Regression analysis to systematically eliminate data.

3. Data imputation techniques.

#Check NA values in the dataset:
apply(is.na(data_owid),2, sum)
##                  continent                   location 
##                         12                          0 
##                       date                 population 
##                          0                          0 
##                 median_age              aged_65_older 
##                         18                         20 
##              aged_70_older             gdp_per_capita 
##                         19                         18 
##            extreme_poverty      cardiovasc_death_rate 
##                         75                         17 
##        diabetes_prevalence             female_smokers 
##                         14                         56 
##               male_smokers     handwashing_facilities 
##                         57                        104 
## hospital_beds_per_thousand            life_expectancy 
##                         33                         12 
##    human_development_index   total_deaths_per_million 
##                         16                          0 
##               total_deaths                   iso_code 
##                          0                          0
#Checking the percentage of the missing values:
p <- function(x) {sum(is.na(x))/length(x)*100}
apply(data_owid, 2, p)
##                  continent                   location 
##                   6.091371                   0.000000 
##                       date                 population 
##                   0.000000                   0.000000 
##                 median_age              aged_65_older 
##                   9.137056                  10.152284 
##              aged_70_older             gdp_per_capita 
##                   9.644670                   9.137056 
##            extreme_poverty      cardiovasc_death_rate 
##                  38.071066                   8.629442 
##        diabetes_prevalence             female_smokers 
##                   7.106599                  28.426396 
##               male_smokers     handwashing_facilities 
##                  28.934010                  52.791878 
## hospital_beds_per_thousand            life_expectancy 
##                  16.751269                   6.091371 
##    human_development_index   total_deaths_per_million 
##                   8.121827                   0.000000 
##               total_deaths                   iso_code 
##                   0.000000                   0.000000

There is a general assumption that 25 to 30% of missing values are allowed. it is advisable to discard it if the variable is insignificant.

#Treating NA values in continent variable:
data_owid <- data_owid %>% select(continent: iso_code) %>% 
  group_by(location) %>% filter(date == max(date)) %>% # Take the most recent date per country
  group_by() #%>% # Remove the grouping
data_owid <- data_owid %>% 
  filter(!is.na(continent))

Comparing the NA values of continent with the location, it can be observed that the location has insignificant values named as world, high income, low income. This shows that missing values in the continent variable are Structurally missing data. Hence these values can be dropped for further analysis.

#Treating NA values in HDI variable:
data_owid_hdi <- data_owid %>% select(location, human_development_index)%>% 
  filter(is.na(human_development_index))
data_owid <- data_owid %>% 
  filter(!is.na(human_development_index))

There are 5 countries which has missing values namely Kosovo, Monaco, San Marino, Somalia and Taiwan. Since it was only 5 countries, measures was made to fill the missing value from the UN dataset. However, it was noticed that HDI value is missing for these 5 countries in the UN dataset itself. Hence dropping the NA values and proceeding with the further analysis.

#Treating NA values for median_age, aged_65_older and aged_70_older variables:
#Imputing median_age, aged_65_older and aged_70_older with mean
mean_imputer <- function(x) 
{x = round(replace(x, is.na(x) , mean(x, na.rm = TRUE)), digits=0)}
data_owid <- data_owid %>% mutate_at(5:7, mean_imputer)

#Treating NA values for handwashing_facilities variable:
#Removing handwashing_facilities since it has more than 50% of the data missing
data_owid <- data_owid %>% 
  select(-handwashing_facilities)
head(data_owid,2)
#Treating NA values for female_smokers and male_smokers with median:
median_imputer <- function(x) 
{x = round(replace(x, is.na(x) , median(x, na.rm = TRUE)), digits=0)}
data_owid <- data_owid %>% mutate_at(12:13, median_imputer)
#Treating NA values for extreme_poverty :
data_owid_extPov <- data_owid %>% 
  select(location, extreme_poverty) %>% 
  filter(is.na(extreme_poverty))
data_owid <- data_owid %>% 
  select(-extreme_poverty)

Extreme poverty is an vital element which contribute to the GDP of a country which shows an adverse effect on HDI. The missing dimensions were approached to be filled with the data available in the world bank dataset. While analyzing the missing values in the data set, it was observed that values are missing for both countries with high HDI and low HDI. This scenario can be considered as Missing Missing completely at random (MCAR). Missing values were approached to be filled from the world bank dataset, but it was found that most of the countries does not have extreme poverty value recorded in the world bank itself. Here I decided to drop the variable because treating the missing values with imputation method might cause more bias which is not a wise approach.

#Treating NA values for hospital_beds_per_thousand by imputing median:
data_owid_bed <- data_owid %>% select(location, hospital_beds_per_thousand)%>% 
  filter(is.na(hospital_beds_per_thousand))
data_owid <- data_owid %>% mutate_at(13, median_imputer)

#Treating NA values for gdp_per_capita by imputing mean:
data_owid_gdp <- data_owid %>% select(location, gdp_per_capita,human_development_index) %>% 
  filter(is.na(gdp_per_capita))
data_owid <- data_owid %>% mutate_at(8, mean_imputer)

#Treating missing values for cardiovasc_death_rate anddiabetes_prevalence by imputing median:
data_owid_cardio <- data_owid %>% select(location, cardiovasc_death_rate) %>% 
  filter(is.na(cardiovasc_death_rate))
data_owid_diabetes <- data_owid %>% select(location, diabetes_prevalence) %>% 
  filter(is.na(diabetes_prevalence))
data_owid <- data_owid %>% mutate_at(9:10, mean_imputer)

4.2.3 Joining dataset - UN and OWID

#Replacing the country name of UN dataset with OWID for joining both the datasets 
un_edu_data$location[un_edu_data$location == "Viet Nam"] <- "Vietnam"
un_edu_data$location[un_edu_data$location == "Venezuela (Bolivarian Republic of)"] <- "Venezuela"
un_edu_data$location[un_edu_data$location == "Timor-Leste"] <- "Timor"
un_edu_data$location[un_edu_data$location == "Tanzania (United Republic of)"] <- "Tanzania"
un_edu_data$location[un_edu_data$location == "Syrian Arab Republic"] <- "Syria"
un_edu_data$location[un_edu_data$location == "Syrian Arab Republic"] <- "Syria"
un_edu_data$location[un_edu_data$location == "Korea (Republic of)"] <- "South Korea"
un_edu_data$location[un_edu_data$location == "Russian Federation"] <- "Russia"
un_edu_data$location[un_edu_data$location == "Palestine, State of"] <- "Palestine"
un_edu_data$location[un_edu_data$location == "Moldova (Republic of)"] <- "Moldova"
un_edu_data$location[un_edu_data$location == "Lao People's Democratic Republic"] <- "Laos"
un_edu_data$location[un_edu_data$location == "Iran (Islamic Republic of)"] <- "Iran"
un_edu_data$location[un_edu_data$location == "Hong Kong, China (SAR)"] <- "Hong Kong"
un_edu_data$location[un_edu_data$location == "Eswatini (Kingdom of)"] <- "Eswatini"
un_edu_data$location[un_edu_data$location == "Congo (Democratic Republic of the)"] <- "Democratic Republic of Congo"
un_edu_data$location[un_edu_data$location == "Côte d'Ivoire"] <- "Cote d'Ivoire"
un_edu_data$location[un_edu_data$location == "Cabo Verde"] <- "Cape Verde"
un_edu_data$location[un_edu_data$location == "Brunei Darussalam"] <- "Brunei"
un_edu_data$location[un_edu_data$location == "Bolivia (Plurinational State of)"] <- "Bolivia"

Approach was taken to join the dataset by considering the ISO code. However, there is no ISO code given in the UN data set to map with OWID data set. Hence, I consider country names, but there were country names mismatch in the UN and OWID dataset. In order, to rectify it I replace the country names of the UN dataset to match with the OWID dataset so I can join it to extract the education index for our analysis. There are 18 entries with country names mismatch. Hence, I map them to correct country name.

#Joining UN and OWID datasets for extracting education index:
data_owid <- left_join(data_owid, un_edu_data, by = c('location'))
head(data_owid,2)

Education index is added as a new column to the OWID dataset.

#Final check of missing values:
apply(is.na(data_owid),2, sum)
##                  continent                   location 
##                          0                          0 
##                       date                 population 
##                          0                          0 
##                 median_age              aged_65_older 
##                          0                          0 
##              aged_70_older             gdp_per_capita 
##                          0                          0 
##      cardiovasc_death_rate        diabetes_prevalence 
##                          0                          0 
##             female_smokers               male_smokers 
##                          0                          0 
## hospital_beds_per_thousand            life_expectancy 
##                          0                          0 
##    human_development_index   total_deaths_per_million 
##                          0                          0 
##               total_deaths                   iso_code 
##                          0                          0 
##            education_index 
##                          0

There is no any NA values present in our dataset. Our code is clean now for further analysis.

4.3 Exploratory Data Analysis

4.3.1 Statistical summary

#Summary statistics with SKIM function
skim(data_owid)
Data summary
Name data_owid
Number of rows 180
Number of columns 19
_______________________
Column type frequency:
character 3
Date 1
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
continent 0 1 4 13 0 6 0
location 0 1 4 32 0 180 0
iso_code 0 1 3 3 0 180 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2021-11-22 2021-11-22 2021-11-22 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
population 0 1 43110963.96 154249026.21 38254.00 2827166.00 9691893.50 31839857.25 1.444216e+09 ▇▁▁▁▁
median_age 0 1 30.35 9.05 15.00 22.00 30.00 39.00 4.800000e+01 ▇▆▇▆▆
aged_65_older 0 1 8.69 6.18 1.00 3.00 6.50 14.00 2.700000e+01 ▇▃▂▂▁
aged_70_older 0 1 5.48 4.20 1.00 2.00 4.00 9.00 1.800000e+01 ▇▂▂▂▁
gdp_per_capita 0 1 18647.89 19406.49 661.00 4394.50 13031.50 26481.00 1.169360e+05 ▇▂▁▁▁
cardiovasc_death_rate 0 1 260.74 117.42 79.00 174.75 245.00 327.00 7.240000e+02 ▇▇▃▁▁
diabetes_prevalence 0 1 7.73 3.92 1.00 5.00 7.00 10.00 2.200000e+01 ▅▇▅▁▁
female_smokers 0 1 9.42 9.40 0.00 2.00 6.00 14.00 4.400000e+01 ▇▁▂▁▁
male_smokers 0 1 32.25 12.01 8.00 25.00 31.00 38.00 7.800000e+01 ▃▇▃▁▁
hospital_beds_per_thousand 0 1 2.88 2.25 0.00 1.00 2.00 4.00 1.300000e+01 ▇▃▂▁▁
life_expectancy 0 1 72.77 7.56 53.28 67.12 74.28 78.10 8.486000e+01 ▂▃▅▇▆
human_development_index 0 1 0.72 0.15 0.39 0.60 0.74 0.85 9.600000e-01 ▃▅▅▇▇
total_deaths_per_million 0 1 897.90 945.23 3.10 107.38 572.22 1497.27 3.988460e+03 ▇▂▂▁▁
total_deaths 0 1 27517.99 88306.33 1.00 555.25 3076.00 14850.50 7.723440e+05 ▇▁▁▁▁
education_index 0 1 11.02 2.91 4.28 8.87 11.44 13.23 1.734000e+01 ▂▆▇▇▂

SKIM function provides an overview of the tidy dataset I have for analysis. After pre-processing the OWID dataset is reduced to 180 observations and 19 variables. This overview represents clearly the column type and its frequency in our dataset. There are 3 column types namely character, date and numeric. The Variable type and their characters are also described here with the data distribution in a series of histogram for each variable separately.

4.3.2 Visualization

#World map of country with their Total death:
l <- list(color = toRGB("black"), width = 0.5) # light black boundaries
# specify map projection/options
g <- list(
  showframe = FALSE,
  showcoastlines = FALSE,
  projection = list(type = 'Mercator')
)
fig <- plot_geo(data_owid)
fig <- fig %>% add_trace(
  z = ~total_deaths, color = ~total_deaths, colors = 'Reds',
  text = ~location, locations = ~iso_code, marker = list(line = l)
)
fig <- fig %>% colorbar(title = 'Covid Death - count', tickprefix = '')
fig <- fig %>% layout(
  title = 'Covid Death - Worldwide',
  geo = g
)
fig

The world map portrays the total death cases in each of the country.

Note: I have taken the variable “total death” only for this visualization, in order to show the intensity of the covid death across countries.

#World map of country with thier HDI:
l <- list(color = toRGB("black"), width = 0.5) # light black boundaries
# specify map projection/options
g <- list(
  showframe = FALSE,
  showcoastlines = FALSE,
  projection = list(type = 'robinson')
)
fig <- plot_geo(data_owid)
fig <- fig %>% add_trace(
  z = ~human_development_index, color = ~human_development_index, colors = 'Greens',
  text = ~location, locations = ~iso_code, marker = list(line = l)
)
fig <- fig %>% colorbar(title = 'Human Development Index', tickprefix = '')
fig <- fig %>% layout(
  title = 'Human Development Index - Worldwide',
  geo = g
)
fig

The world map illustrates the HDI of each country.

#Scatterplot Matrices:
scatter_owid <- data_owid[c(8,14,16,15,19)] #Selecting only the required variables
scatter_owid.r <- abs(cor(scatter_owid)) #Getting the correlation
scatter_owid.col <- dmat.color(scatter_owid.r) #Getting colours for variables
scatter_owid.o <- order.single(scatter_owid.r) 
cpairs(scatter_owid, scatter_owid.o, panel.colors=scatter_owid.col, gap=.5,
       main="Scatterplot - Total death per million, HDI and its components" )

The required variables which corresponds to HDI and total death per million is selected to plot the scatter plot. This illustrates how the data is plotted against the other variable. I can clearly see that when there is increase in HDI, life expectancy, and education index there is increase in total death as well.

#scatterplot3d
attach(data_owid) 
## The following object is masked from package:tidyr:
## 
##     population
scatterplot3d(median_age,human_development_index,total_deaths_per_million, 
              pch=16, 
              highlight.3d=TRUE,
              type="h", 
              main="3D Scatterplot",
              phi = 0,
              ticktype = "detailed"
              )

A 3D scatter plot is plotted with numerical variables. Here x and y axis is taken as median_age and human_development_index with respect to our target variable in the z axis which is total_deaths_per_million. Here I can see all the data points are rising with increase in total deaths which also denotes a positive correlation between them. Here we can observe that as median age increases the HDI increases. This also indicates that total death cases increases as the HDI increases.

#Checking the data distribution - quantile-quantile plot
DataExplorer::plot_qq(data_owid[4:10])

DataExplorer::plot_qq(data_owid[11:16])

Here I check the distribution of our data with the help of QQ plot. If the data is normally distributed, the points in the QQ-normal plot lie on a straight diagonal line. According to the qq plot the following variables are normally distributed: Aged_65_older Aged_70_older Cardiovasc_deaths_rates Diabetes_prevalence Life expectancy Human_Development_Index Total_deaths_per_million

#Scatterplot - HDI with total death with respect to each Continent
ggplot(data = data_owid) + 
  geom_point(mapping = aes(x = human_development_index, y = total_deaths_per_million, color = continent))

This scatter plot is plotted to see the relationship of total_deaths_per_million and human_development_index with respect to each continent. Here I can observe that Europe has high HDI and it has contributed to more death. In contrast, I have Africa which has low HDI and low death. Continent like Asia, South America, North America and Oceania are moderate in HDI and total death.

#Top 5 countries with high HDI and low 5 countries with less HDI
#select relevant variables for plotting
Owid_data_Graph <- data_owid %>% 
  select(iso_code,location,median_age,aged_70_older,cardiovasc_death_rate,diabetes_prevalence,life_expectancy,gdp_per_capita,education_index,human_development_index,total_deaths_per_million)
OWiD_hdi <- Owid_data_Graph %>%
  select(location,human_development_index)
#Select top5 and bottom 5 countries by HDI
OWiD_high_hdi <- OWiD_hdi  %>% top_n(5)
## Selecting by human_development_index
OWiD_low_hdi <- OWiD_hdi  %>% top_n(-5)
## Selecting by human_development_index
#Combine top and bottom HDI countries
OWiD_high_low<-bind_rows(OWiD_high_hdi, OWiD_low_hdi)
#Color for bar chart
coul <- brewer.pal(5, "Set2") 
barplot(OWiD_high_low$human_development_index,
        main = "Top 5 and low 5 HDI Countries",
        ylab = "HDI",
        xlab = "Countries",
        ylim=c(0,1),
        names.arg = OWiD_high_low$location,
        col = coul,
        width=c(1),
        density=c(20),
        las=1,
        horiz = FALSE
)

The above bar plot illustrates the countries which has high HDI and low HDI. Hong Kong, Iceland, Ireland, Norway and Switzerland are countries with high HDI. Burundi, Central African Republic, Chad, Niger and South Sudan are countries with low HDI.

#Top 5 countries (HDI components with total death per million)
#High HDI countries, plot all HDI parameters
#Select - Top5 HDI countries
OWiD_all_Param_Top <- Owid_data_Graph[order(-Owid_data_Graph$human_development_index),][1:5,]
#apply log to scale down the index values
OWiD_all_Param_Top$total_deaths_per_million <- round(log(OWiD_all_Param_Top$total_deaths_per_million),2)
OWiD_all_Param_Top$gdp_per_capita <- round(log(OWiD_all_Param_Top$gdp_per_capita),2)
OWiD_all_Param_Top$life_expectancy <- round(log(OWiD_all_Param_Top$life_expectancy),2)
OWiD_all_Param_Top$education_index <- round(log(OWiD_all_Param_Top$education_index),2)
OWiD_all_Param <- OWiD_all_Param_Top  %>%
  select(location,life_expectancy,gdp_per_capita,education_index,total_deaths_per_million)
#convert to Pivot longer for plot
OWiD_all_Param_PL <- OWiD_all_Param %>%
  pivot_longer(cols=-c("location"), names_to = "Index" , values_to = "Log_Index")
ggplot_all_param <- ggplot(data=OWiD_all_Param_PL, aes(x=location, y=Log_Index, fill=Index)) +
  geom_bar(stat="identity", position=position_dodge())+
  geom_text(aes(label=Log_Index), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5)+
  scale_fill_brewer(palette="Paired")+
  scale_fill_manual(values = coul)+
  theme_minimal()
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
ggplot_all_param + ggtitle("Top 5 HDI countries for HDI Parameters")

I plot a bar graph taking country with high HDI to visualize how life_expectancy, gdp_per_capita and education_index care are with respect to the total death. Here we can observe that all the components plotted here are relatively similar, except for Ireland and Switzerland which shows slight increase in deaths despite being comparatively similar to other countries.

Note: Applied log for all variables to fit them into scale for the bar chart visualization.

#Low 5 countries (HDI components with total death per million)
#Select - Low 5 HDI countries
OWiD_all_Param_Top <- Owid_data_Graph[order(Owid_data_Graph$human_development_index),][1:5,]
#apply log to scale down the index values
OWiD_all_Param_Top$total_deaths_per_million <- round(log(OWiD_all_Param_Top$total_deaths_per_million),2)
OWiD_all_Param_Top$gdp_per_capita <- round(log(OWiD_all_Param_Top$gdp_per_capita),2)
OWiD_all_Param_Top$life_expectancy <- round(log(OWiD_all_Param_Top$life_expectancy),2)
OWiD_all_Param_Top$education_index <- round(log(OWiD_all_Param_Top$education_index),2)
OWiD_all_Param <- OWiD_all_Param_Top  %>%
  select(location,life_expectancy,gdp_per_capita,education_index,total_deaths_per_million)
#convert to Pivot longer for plot
OWiD_all_Param_PL <- OWiD_all_Param %>%
  pivot_longer(cols=-c("location"), names_to = "Index" , values_to = "Log_Index")
ggplot_all_param <- ggplot(data=OWiD_all_Param_PL, aes(x=location, y=Log_Index, fill=Index)) +
  geom_bar(stat="identity", position=position_dodge())+
  geom_text(aes(label=Log_Index), vjust=1.6, color="white", 
            position = position_dodge(0.9), size=3.5)+
  scale_fill_brewer(palette="Paired")+
  scale_fill_manual(values = coul)+
  theme_minimal()
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
ggplot_all_param + ggtitle("Low 5 HDI countries for HDI Parameters")

I plot a bar graph taking country with high low to visualize how life_expectancy, gdp_per_capita and education_index care are with respect to the total death. Here we can observe that Central African Republic has contributed to more death even though their other components are relatively similar to other countries and Burundi has less deaths.

Note: Applied log for all variables to fit them into scale for the bar chart visualization.

# Top 5 and low 5 countries with high cardiac death rate and their covid death
#Select - Top 5 HDI countries
OWiD_all_cardiac_top <- Owid_data_Graph[order(-Owid_data_Graph$human_development_index),][1:5,]
#select required variables cardiac_death and covid death
OWiD_all_cardiac_covid_high <- OWiD_all_cardiac_top  %>%
  select(location,cardiovasc_death_rate,total_deaths_per_million,human_development_index)
#Select - bottom 5 HDI countries
OWiD_all_cardiac_bottom <- Owid_data_Graph[order(Owid_data_Graph$human_development_index),][1:5,]
#select required variables cardiac_death and covid death
OWiD_all_cardiac_covid_bottom <- OWiD_all_cardiac_bottom  %>%
  select(location,cardiovasc_death_rate,total_deaths_per_million,human_development_index)
#Combine top and bottom HDI countries
OWiD_all_cardiac_top<-bind_rows(OWiD_all_cardiac_covid_high, OWiD_all_cardiac_covid_bottom)
#Apply log to to fit to the bar chart scale
OWiD_all_cardiac_top$cardiovasc_death_rate <- round(log(OWiD_all_cardiac_top$cardiovasc_death_rate),2)
OWiD_all_cardiac_top$total_deaths_per_million <- round(log(OWiD_all_cardiac_top$total_deaths_per_million),2)
#Convert to Pivot longer for plot
OWiD_all_cardiac_top <- OWiD_all_cardiac_top %>%
  pivot_longer(cols=-c("location"), names_to = "Index" , values_to = "value")
#Horizontal Plot
ggplot(OWiD_all_cardiac_top[order(OWiD_all_cardiac_top$Index, decreasing = T),],
       aes(value,  y=reorder(location, -value), label= round(value,2), fill=factor(Index, levels=c("total_deaths_per_million","cardiovasc_death_rate","human_development_index")))) + 
  geom_bar(stat = "identity") +
  scale_fill_discrete(name = "Category", labels = c("total_deaths_per_million","cardiovasc_death_rate","human_development_index"))+
  scale_x_continuous(breaks=c(2,4,6,8,10,12,14)) +
  ylab("Countries") + 
  ggtitle("Top 5 and bottom 5 HDI countries with corresponding cardiac death rate and covid_death")

Seeing the trend of the top 5 and bottom 5 countries on HDI and their cardiac death and covid death, there is a visible trend which shows,

  1. for the high HDI countries, the cardiac death is relatively low but the covid death is relatively high.

  2. for the low HDI countries, the cardiac death is relatively high but the covid death is relatively low.

Note: Applied log to cardiac death rate and covid death variable to fit them in the bar chart. This chart is to show the the 3 variables and their trend, not the actual value.

# Top 5 and low 5 countries with high diabetes prevalence and their covid death
#Select -  Top 5 HDI countries
OWiD_all_diabetes_top <- Owid_data_Graph[order(-Owid_data_Graph$human_development_index),][1:5,]
#select required variables cardiac_death and diabetes prevalence
OWiD_all_diabetes_top <- OWiD_all_diabetes_top  %>%
  select(location,diabetes_prevalence,total_deaths_per_million,human_development_index)
#Select - bottom 5 HDI countries
OWiD_all_diabetes_bottom <- Owid_data_Graph[order(Owid_data_Graph$human_development_index),][1:5,]
#select required variables cardiac_death and diabetes prevalence
OWiD_all_diabetes_bottom <- OWiD_all_diabetes_bottom  %>%
  select(location,diabetes_prevalence,total_deaths_per_million,human_development_index)
#Combine top and bottom HDI countries
OWiD_all_diabetes<-bind_rows(OWiD_all_diabetes_top, OWiD_all_diabetes_bottom)
#Apply log to to fit to the bar chart scale
OWiD_all_diabetes$total_deaths_per_million <- round(log(OWiD_all_diabetes$total_deaths_per_million),2)
#Convert to Pivot longer for plot
OWiD_all_diabetes <- OWiD_all_diabetes %>%
  pivot_longer(cols=-c("location"), names_to = "Index" , values_to = "value")
#Horizontal Plot
ggplot(OWiD_all_diabetes[order(OWiD_all_diabetes$Index, decreasing = T),],
       aes(value,  y=reorder(location, -value), label= round(value,2), fill=factor(Index, levels=c("diabetes_prevalence","total_deaths_per_million","human_development_index")))) +
  geom_bar(stat = "identity") +
  scale_fill_discrete(name = "Category", labels = c("diabetes_prevalence","total_deaths_per_million","human_development_index"))+
  scale_x_continuous(breaks=c(2,4,6,8,10,12,14,16)) +
  ylab("Countries") + 
  ggtitle("Top 5 and bottom 5 HDI countries with corresponding diabetes prevalance and covid_death")

Diabetes prevalence for high and low HDI countries are not hugely varied. No clear trend been seen between diabetic prevalence and covid death for high and low HDI countries. Hence the covid death may not be related to diabetes prevalence, but need to be analysed with other related parameters.

Note: Applied log to covid death variable to fit all variables in the bar chart. This chart is to show the the 3 variables and their trend, not the actual value.

# Top 5 and bottom 5 countries with Aged-70_older and their median age and death per million
#Select - 5 aged_70_older
OWiD_all_aged_top <- Owid_data_Graph[order(-Owid_data_Graph$human_development_index),][1:5,]
#select required variables cardiac_death and diabetes prevalence
OWiD_all_aged_top <- OWiD_all_aged_top  %>%
  select(location,median_age,aged_70_older,total_deaths_per_million,human_development_index)
#Select - bottom 5 aged_70_older
OWiD_all_aged_bottom <- Owid_data_Graph[order(Owid_data_Graph$human_development_index),][1:5,]
#select required variables cardiac_death and diabetes prevalence
OWiD_all_aged_bottom <- OWiD_all_aged_bottom  %>%
  select(location,median_age,aged_70_older,total_deaths_per_million,human_development_index)
#Combine top and bottom aged_70_older
OWiD_all_aged<-bind_rows(OWiD_all_aged_top, OWiD_all_aged_bottom)
#Apply log to to fit to the bar chart scale
OWiD_all_aged$total_deaths_per_million <- round(log(OWiD_all_aged$total_deaths_per_million),2)
#Convert to Pivot longer for plot
OWiD_all_aged <- OWiD_all_aged %>%
  pivot_longer(cols=-c("location"), names_to = "Index" , values_to = "value")
#Horizontal Plot
ggplot(OWiD_all_aged[order(OWiD_all_aged$Index, decreasing = T),],
       aes(value,  y=reorder(location, -value), label= round(value,2), fill=factor(Index, levels=c("aged_70_older","median_age","total_deaths_per_million","human_development_index")))) + 
  geom_bar(stat = "identity") +
  scale_fill_discrete(name = "Category", labels = c("aged_70_older","median_age","total_deaths_per_million","human_development_index"))+
  scale_x_continuous(breaks=c(10,20,30,40,50,60,70,80)) +
  ylab("Countries") + 
  ggtitle("Top 5 and bottom 5 HDI countries with corresponding aged_70_order and median age and total deaths per million")

Median age and aged_70_older are related variables is a known fact. From this bar chart we can infer that, higher the HDI means Higher the median age and higher the covid death.

Note: Applied log to covid death variable to fit all variables in the bar chart scale. This chart is to show the the 4 variables and their trend, not the actual value.

4.3.3 Correlation

owid_corr <- cor(data_owid[,c(4:16)])
corrplot(owid_corr, method = "circle")

Correlation defines how strong the variables are related to each other. There are 3 types of correlation,

1. Positive correlation

2. Negative correlation

3. Zero correlation

Correlation coefficient is a statistical measure to identify the type of correlation existing within the variables. From the correlation plot. We can observe from the correlation plot that median_age, aged_65_older and aged_70_older are more correlate to each other and also shows a high positive correlation. Female_smokers also shows a high positive correlation with variables related to age. cardiovasc_death_rate shows negative correlation. However, they are low so it can be ignored.

5 Analysis

5.1 Hypothesis testing

Hypothesis testing is a statistical test performed to prove an idea or an assumption by using sample data. The test provides evidence concerning the plausibility of the hypothesis. The motive of the project is to identify the correlation between Human Development Index and the total deaths caused by COVID-19 pandemic. This can be found by performing the hypothesis testing. The selected variables are quantitative data, hence we can perform a Spearman’s correlation test which provides us correlation as well as the direction of the relationship between two continuous variables.

Hypothesis Statement:

H0 (Null hypothesis) : There is no linear relationship between HDI and number of death

H1 (Alternate hypothesis) : There is a linear relationship between HDI and number of death

The H0 and H1 are mutually exclusive, and only one can be true. However, one of the two hypothesis will always be true

It is required to check the normality of the data before a hypothesis is performed.Here I have chosen Shapiro-Wilk normality test for evaluation as they are widely recommended and more powerful than the other tests. Shapiro-Wilk normality testdepends on the correlation between the data and their normal scores.

Shapiro-Wilk normality test

#Normality test - Shapiro-Wilk
shapiro.test(data_owid$human_development_index)
## 
##  Shapiro-Wilk normality test
## 
## data:  data_owid$human_development_index
## W = 0.95424, p-value = 1.416e-05
#p-value =  1.416e-05
shapiro.test(data_owid$total_deaths_per_million)
## 
##  Shapiro-Wilk normality test
## 
## data:  data_owid$total_deaths_per_million
## W = 0.85444, p-value = 4.103e-12
#p-value = 5.283e-12

From the output, the two p-values are less than the significance level of 0.05 implying that the distribution of the data is significantly different from normal distribution, which means the data is not normally distributed.

Hypothesis testing:

  1. HDI vs Total Death - All countries

    I perform Spearman’s correlation test because from the Shapiro-Wilk normality test it can be observed that the data does not follow normal distribution.

hdi_corr_all <- cor.test(data_owid$human_development_index, data_owid$total_deaths_per_million, 
                     method = "spearman", exact = FALSE)
hdi_corr_all
## 
##  Spearman's rank correlation rho
## 
## data:  data_owid$human_development_index and data_owid$total_deaths_per_million
## S = 375134, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.6140481

The p-value here is 2.2e-16, which is less than the significance level alpha = 0.05. From this it can concluded that HDI and total deaths are significantly correlated with a correlation coefficient of 0.6109514. Hereby, we reject H0 and accept H1.

  1. HDI vs Total Death - High HDI and low HDI:

    Assumption : Countries with HDI 0.7 and above are considered to be countries with high HDI Countries with HDI 0.5 and below are considered to be countries with low HDI

#Hypothesis Test - Country with high HDI vs Total Death
#Selecting countries with high HDI
OWiD_high_hdi <- data_owid %>% filter(human_development_index > 0.7)
OWiD_high_hdi
#Spearman's correlation test:
hdi_corr_high <- cor.test(OWiD_high_hdi$human_development_index, OWiD_high_hdi$total_deaths_per_million, 
                     method = "spearman", exact = FALSE)
hdi_corr_high
## 
##  Spearman's rank correlation rho
## 
## data:  OWiD_high_hdi$human_development_index and OWiD_high_hdi$total_deaths_per_million
## S = 226362, p-value = 0.5372
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.05864518

Here p-value = 0.5004 and Correlation coefficient = 0.6403101. Here we fail to reject H0. It can be concluded that there is no linear relation between HDI and number of deaths in countries with high HDI.

  1. HDI vs Total Death - Country with low HDI vs Total Death
#Selecting countries with low HDI
OWiD_low_hdi <- data_owid %>% filter(human_development_index < 0.5)
OWiD_low_hdi
#Spearman's correlation test :
hdi_corr_low <- cor.test(OWiD_low_hdi$human_development_index, OWiD_low_hdi$total_deaths_per_million, 
                     method = "spearman", exact = FALSE)
hdi_corr_low
## 
##  Spearman's rank correlation rho
## 
## data:  OWiD_low_hdi$human_development_index and OWiD_low_hdi$total_deaths_per_million
## S = 233.72, p-value = 0.0002614
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7588028

Here p-value = 0.0003852 and Correlation coefficient = 0.7453452 p value is less than 5%, hence we reject H0 and accepting H1 which states there is linear relation between HDI and total death in countries with low HDI

  1. HDI vs Total Death - Continent Wave 1 - European region
#Extracting data with date of Wave 1
data_owid_wave1 <- data_Covid %>% 
  select(continent, location, date, human_development_index, total_deaths_per_million) %>% 
  group_by(location) %>% filter(date == "2020-06-01") %>% # Take the most recent date per country
  group_by() #removing groupby

Creating another dataset filtering date of 2020-06-01 which denotes the first wave of the pandemic.

#Renaming the variables:
colnames(data_owid_wave1)[6] <- "total_deaths_per_million_Wave1"
head(data_owid_wave1,1)

Rename the column - total_deaths_per_million_Wave as total_deaths_per_million_Wave1 which would be used in further analysis while joining both the dataset - data_owid and data_owid_wave1

#Joining both the dataset
data_owid_wave1 <- left_join(data_owid, data_owid_wave1, by = c('location'))
head(data_owid_wave1,2) 
apply(is.na(data_owid_wave1),2, sum)
##                continent.x                   location 
##                          0                          0 
##                     date.x                 population 
##                          0                          0 
##                 median_age              aged_65_older 
##                          0                          0 
##              aged_70_older             gdp_per_capita 
##                          0                          0 
##      cardiovasc_death_rate        diabetes_prevalence 
##                          0                          0 
##             female_smokers               male_smokers 
##                          0                          0 
## hospital_beds_per_thousand            life_expectancy 
##                          0                          0 
##  human_development_index.x total_deaths_per_million.x 
##                          0                          0 
##               total_deaths                   iso_code 
##                          0                          0 
##            education_index                continent.y 
##                          0                          1 
##                     date.y  human_development_index.y 
##                          1                          1 
## total_deaths_per_million.y 
##                         19

19 countries does not have data as of 2020-06-01. This might be a reason that wave 1 would not have started for these countries as of June 2020.

#Removing NA from total_deaths_per_million.y
data_owid_wave1 <- data_owid_wave1 %>% filter(!is.na(total_deaths_per_million.y))
#Filtering continent - Europe
data_owid_wave1 <- data_owid_wave1 %>% 
  filter(continent.x == "Europe")
#Spearman's correlation test:
hdi_corr_europe <- cor.test(data_owid_wave1$human_development_index.y, data_owid_wave1$total_deaths_per_million.y,
                            method = "spearman", exact = FALSE)
hdi_corr_europe
## 
##  Spearman's rank correlation rho
## 
## data:  data_owid_wave1$human_development_index.y and data_owid_wave1$total_deaths_per_million.y
## S = 6626.5, p-value = 0.002015
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## 0.463047

Here p-value = 0.02015 and Correlation coefficient = 0.463047, here p value is less than 5%, hence we reject H0 and accepting H1 which states there is linear relation between HDI and total death in European countries when first wave occurred.

5.1.1 Hypothesis test - Table and comparison

#Creating Matrix to add correlation result:
table_corr<- matrix(0,4,4) # entry, rows, columns
row.names(table_corr) <- c("All_countries", "Countries_High_HDI ", "Countries_low_HDI ", "Wave_1_Europe")
colnames(table_corr) <- c("P-value","T-test","Correlation", "H0/H1")

#Result for all countries
table_corr[1,1] <- round(hdi_corr_all$p.value,3)
table_corr[1,2] <- round(hdi_corr_all$statistic,3)
table_corr[1,3] <- round(hdi_corr_all$estimate,3)
table_corr[1,4] <- "H1"

#Result for High HDI countries
table_corr[2,1] <- round(hdi_corr_high$p.value,3)
table_corr[2,2] <- round(hdi_corr_high$statistic,3)
table_corr[2,3] <- round(hdi_corr_high$estimate,3)
table_corr[2,4] <- "H0"

#Result for Low HDI countries
table_corr[3,1] <- round(hdi_corr_low$p.value,3)
table_corr[3,2] <- round(hdi_corr_low$statistic,3)
table_corr[3,3] <- round(hdi_corr_low$estimate,3)
table_corr[3,4] <- "H1"

#Result for wave1 - Europe
table_corr[4,1] <- round(hdi_corr_europe$p.value,3)
table_corr[4,2] <- round(hdi_corr_europe$statistic,3)
table_corr[4,3] <- round(hdi_corr_europe$estimate,3)
table_corr[4,4] <- "H1"

table_corr <- as.data.frame(table_corr)

5.1.2 Summarizing the hypothesis results:

table_corr

Plotting the correlation obtained in the hypothesis results:

#human_development_index and total_deaths_per_million - All countries
ggscatter(data_owid, x = "human_development_index", y = "total_deaths_per_million", 
          add = "reg.line", conf.int = TRUE, 
          color = "blue",
          palette = NULL,
          cor.coef = TRUE, 
          cor.method = "spearman",
          xlab = "Human Development Index", ylab = "Deaths per million",
          title = 'Human Development Index and Deaths per million - All countries')
## `geom_smooth()` using formula 'y ~ x'

We can observe from the Spearman’s Correlation test between the HDI and total deaths in all country that H0 is rejected since p-value is less than significance value of 0.05. This brings to a conclusion that both the variables are linearly dependent. Furthermore, by visualizing the result through scatter plot, we can conclude saying the variables are positively correlated. The Spearman’s correlation coefficient r which is 0.61 which again justifies that it is far away from 0 which makes the linear relationship more stronger between the HDI and total deaths. It can be interfered that as HDI increases the total death also increases.

#Plot for High HDI:
ggscatter(OWiD_high_hdi, x = "human_development_index", y = "total_deaths_per_million", 
          add = "reg.line", conf.int = TRUE, 
          color = "red",
          palette = NULL,
          cor.coef = TRUE, 
          cor.method = "spearman",
          xlab = "Human Development Index", ylab = "Deaths per million",
          title = 'Human Development Index and Deaths per million - High HDI countries')
## `geom_smooth()` using formula 'y ~ x'

The Spearman’s Correlation test between the country with high HDI and total deaths concludes that H1 is accepted and we failed to reject H0 since p-value is more than significance value of 0.05. This concludes that both the variables are not linearly dependant. With the visualizing of scatter plot, we can conclude saying the variables are not correlated with each other and they do not project any linear dependencies between them. To makes this statement even stronger, the Spearman’s correlation coefficient r which is 0.064 which displays zero correlation between the two variables.

5.2 Linear Regression - Modelling:

From the hypothesis testing, we can see there is no relation between the total deaths per million and the HDI in countries with high HDI. Hence to identify what are the other contributing factors leading to covid death, I now perform a linear regression between countries with high HDI and covid deaths.

#Feature selection:
#Selecting only required data as per correlation plot:
data_owid <- data_owid %>%
            filter(human_development_index > 0.7) %>%
            select (population,  median_age, gdp_per_capita, cardiovasc_death_rate, 
                    diabetes_prevalence, male_smokers,hospital_beds_per_thousand, life_expectancy, 
                    human_development_index, education_index, total_deaths_per_million)

Select only required variables to see the relationship between death and HDI.

#Normalise the data points:
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
data_owid$population<-normalize(data_owid$population)
data_owid$median_age<-normalize(data_owid$median_age)
data_owid$gdp_per_capita<-normalize(data_owid$gdp_per_capita)
data_owid$cardiovasc_death_rate<-normalize(data_owid$cardiovasc_death_rate)
data_owid$diabetes_prevalence<-normalize(data_owid$diabetes_prevalence)
data_owid$male_smokers<-normalize(data_owid$male_smokers)
data_owid$hospital_beds_per_thousand<-normalize(data_owid$hospital_beds_per_thousand)
data_owid$life_expectancy<-normalize(data_owid$life_expectancy)
data_owid$human_development_index<-normalize(data_owid$human_development_index)
data_owid$education_index<-normalize(data_owid$education_index)

Since the dataset is not normalized as per the qq plot, performing a normality function to normalize data, which is essential for performing linear regression.

#Splitting the data
set.seed(1234)
create_train_test <- function(data, size = 0.8, train = TRUE) {
  n_row = nrow(data)
  total_row = size * n_row
  train_sample <- 1: total_row
  if (train == TRUE) {
    return (data[train_sample, ])
    } else {
      return (data[-train_sample, ])
    }
  }
data_train <- create_train_test(data_owid, 0.8, train = TRUE)
data_test <- create_train_test(data_owid, 0.8, train = FALSE)
dim(data_train)
## [1] 90 11
dim(data_test)
## [1] 23 11

Here the dataset is split into two datasets. 1. Train 2. Test

#Fit, train and predicting the model
fitControl <- trainControl(method = "none")
lm_train_pred <- train(total_deaths_per_million ~., 
                       data = data_train,
                       method = "lm",
                       trControl = fitControl)
lm_train_pred
## Linear Regression 
## 
## 90 samples
## 10 predictors
## 
## No pre-processing
## Resampling: None
lm_pred = predict(lm_train_pred, data_test)
summary(lm_train_pred)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1424.3  -653.3  -139.5   499.0  1826.2 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  2478.3      770.8   3.215  0.00189 ** 
## population                  -1884.8      846.7  -2.226  0.02886 *  
## median_age                   2843.3      678.2   4.192 7.15e-05 ***
## gdp_per_capita              -2850.6      893.8  -3.189  0.00205 ** 
## cardiovasc_death_rate         187.6      838.0   0.224  0.82340    
## diabetes_prevalence         -1006.7      536.4  -1.877  0.06423 .  
## male_smokers                 -547.0      611.7  -0.894  0.37390    
## hospital_beds_per_thousand   -862.3      645.6  -1.336  0.18549    
## life_expectancy             -2105.6     1188.1  -1.772  0.08021 .  
## human_development_index      2149.9     1506.7   1.427  0.15755    
## education_index             -2732.5     1278.6  -2.137  0.03569 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 833.6 on 79 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.2783 
## F-statistic: 4.432 on 10 and 79 DF,  p-value: 5.941e-05

From the linear regression summary it can be confirmed that

a. Cardiac_death_rate and male_Smokers are not statistically significant because their p-values are greater than the usual significance level of 0.05

b. The HDI coefficient in the regression equation is 2137.95. This coefficient represents the mean increase of total deaths per million for every additional one increase in HDI, which means if HDI increases by 0.1, the average total deaths per million increases by 2137.95. It is also interesting to find that median_age shows a high positive correlation with covid death. The coefficient of median age in the regression equation is 2572.14. If median age increases by 1, the average total deaths per million increases by 2572.14.

c. There is a negative coefficient for variables such as gdp_per_capita, diabetes_prevalence, life_expectancy and education_index which suggests that as these variables increases, the covid death decreases.

6 Conclusion

From the overall analysis, identified that there exist a linear relation between HDI and covid death. However, while exploring deep into separate countries with high HDI and low HDI we can observe that, high HDI countries has no relationship with covid death. Hence performing a linear relation to identify what are the other factors contributing to death in countries with high HDI. It can be concluded from the linear model that median age variable (aged population) is one of the prominent factor which contributes to total death, which means high median age contributes to high covid deaths.

7 Critique

Alternative Explanations:

From general observation, another probable reason why countries with high HDI had high death rates is due to the longer life expectancy as compared to low HDI countries. This means High HDI countries have a large population of elederly people which are more prone to risk of contracting diseases due to weakened immune systems thus the corona virus pandemic took a toll on high HDI countries. We can also notice that when vaccines where being administered, high priority was given to the old age since they were more at risk.

Limitations:

1.Initially OWID dataset has 237 observations with 20 variables. However, after cleaning the data with several pre-processing steps and handling all missing values the dataset was reduced to 180 observations and 19 variables which is a small dataset. Around 24% of the data is been removed, which might have an impact on the analysis and statistical output.

2.No scientific methods applied other than mean and median to replace NA values.